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17 Abstract 

18 A novel genome- wide genetics platform is presented in this study, which permits functional 

19 interrogation of all point mutations across a viral genome in parallel. Here we generated the first 

20 fitness profile of individual point mutations across the influenza virus genome. Critical residues 

21 on the viral genome were systematically identified, which provided a collection of subdomain 

22 data informative for structure-function studies and for effective rational drug and vaccine design. 

23 Our data was consistent with known, well-characterized structural features. In addition, we 

24 have achieved a validation rate of 68% for severely attenuated mutations and 94% for neutral 

25 mutations. The approach described in this study is applicable to other viral or microbial genomes 

26 where a means of genetic manipulation is available. 
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27 Introduction 

28 The influenza virus causes several hundred thousand deaths every year, and this number can 

29 reach millions in pandemic years. The huge socioeconomic associated with influenza highlights 

30 the importance of understanding of virus-host interactions [1,2]. The rapidly evolving nature of 

31 influenza challenges the development of anti-influenza drugs and vaccine [3-7]. Consequently, 

32 it is important to develop drugs or vaccines that target indispensable regions on the influenza 

33 virus to maximize the genetic barrier for the emergence of resistance or escape mutations. Nev- 

34 ertheless, genetic research on the influenza virus has largely relied on naturally variants and 

35 individual mutants created in the laboratory. A substantial part of the genome remains unchar- 

36 acterized. 
37 

38 Traditional genetics studies the relationship of a single genotype-phenotype at a time, and has 

39 been extensively to study panels of influenza mutations. However, the low throughput of tradi- 

40 tional genetics limited the number of mutations being examined. In contrast, high-throughput 

41 genetics interrogates the phenotypic outcomes of multiple mutants in parallel. Genome- wide 

42 insertional mutagenesis is a common high-throughput genetics approach. It has been employed 

43 in the influenza virus to systematically identify regions that are tolerate to mutations [8] . How- 

44 ever, the resolution of insertion-based approach is limited at the protein subdomain level. This 

45 resolution is insufficient to identify residues critical for replication. As a result, there is a demand 

46 for a high-throughput genetics platform at a single-residue resolution. 
47 

48 Recently, we have developed a high-throughput genetic platform which allowed us to profile 

49 the fitness effect of individual point mutations across the influenza A virus hemagglutinin (HA) 

50 segment [9]. The principle of the high-throughput genetic platform is to utilize a large mutant 

51 library and deep sequencing. Here, we extended this approach to quantify the fitness effects of 

52 each point mutation in 96% of the influenza A virus genome. This technique will enable system- 

53 atic identification of indispensable regions for drug or vaccine targets. More importantly, it can 
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54 be applied to any specified growth conditions for any virus that can be genetically manipulated. 

55 Results 

56 Quantification of the fitness effect of individual point mutation 

57 Our high-throughput genetics platform aims to randomly mutagenize each nucleotide of the 

58 genome, monitor the changes in occurrence frequency for individual point mutations under 

59 specified growth conditions using massive deep-sequencing [9]. The changes in occurrence fre- 

60 quency of each point mutation (such as diminishment or enrichment) allow us to quantify the 

61 mutational fitness outcomes under the given growth conditions. The mutant libraries were cre- 

62 ated by error-prone PCR on the eight-plasmid reverse genetics system influenza A/WSN/1933 

63 (H1N1) [10] (see materials and methods). Subsequently, eight viral mutant libraries were gen- 

64 erated by transfection, each with one of the eight segments mutagenized. All viral mutant 

65 libraries were passaged for two 24-hour rounds in A549 cells (human lung epithelial carcinoma 

66 cells). The plasmid library and the passaged viral library were each sequenced by Illumina HiSeq 

67 2000. Here, a relative fitness index (RF index) is used to estimate the mutational fitness effect. 

68 The RF index is calculated as: 
69 

70 RF index = occurrence frequency in passaged library) / (occurrence frequency in plasmid library) 
71 

72 The occurrence frequency of individual mutations was expected to be lower than the sequencing 

73 error rate (~0.1%-1%) in next generation sequencing (NGS). Therefore, we utilized a two-step 

74 PCR approach for sequencing library preparation to distinguish true mutations from sequencing 

75 errors. In the first PCR, a unique tag was assigned to individual molecules. The second PCR 

76 generated multiple identical copies for individual tagged molecules. The input copy number for 

77 the second PCR was well-controlled such that individual tagged molecules would be sequenced 

78 ~10 times. True mutations would exist in all sequencing reads sharing the same tag, whereas 
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79 sequencing errors would not. Individual molecules, each carrying a unique tag, have an average 

80 copy number of ~10 in the sequencing data, which validated the sequencing library preparation 

81 design (Fig. SI). 

82 Point mutation fitness profiling of influenza A virus genome 

83 The RF indices for individual point mutations were profiled across 96% of nucleotide positions 

84 in the influenza A/WSN/1933 virus genome (Fig. 1). The remaining 4% of nucleotide were from 

85 the termini of each gene segment due to PCR amplification difficulty. As expected, a positive 

86 correlation exists between RF index and the degree of amino acid conservation of missense 

87 mutations (Fig. S2). In addition, the fitness data for well-characterized mutants were consistent 

88 with their phenotypes reported in the literature. Examples include a critical salt bridge for 

89 viral replication on nucleoprotein (NP) [11] (Fig. S3 A), replication enhancement mutation on 

90 polymerase subunit (PB2) [12] (Fig. S3B), attenuation of oseltamivir resistance mutation on 

91 neuraminidase (NA) [13] (Fig. S3C), low fitness cost of amantadine /rimantadine resistance 

92 mutations on ion channel (M2) [5,14,15] (Fig. S3D), and the basic stretch on matrix protein (Ml) 

93 required for assembly [16] (Fig. S4). Furthermore, comparison between our fitness data with 

94 the polymerase activity on 19 PB1 mutants previously reported showed an 80% correlation [17]. 

95 Mutants that displayed a severely attenuated (RF index <0.05) or neutral (RF index >0.4) 

96 phenotype were randomly selected across the genome, individually constructed and tested. The 

97 replication phenotype of each single mutant validated the profiling data with a confirmation rate 

98 of 68% for severely attenuated mutations and 94% for neutral mutations (Fig. 2). These data 

99 taken together provides validity to our fitness profiling data set. 

100 Structural analysis and identification of indispensable protein surface 

101 Our high-throughput profiling technique provides a basis to identify essential protein surfaces 

102 for drug targeting and indispensable regions for vaccine epitopes. We have performed a struc- 

103 tural analysis on NA, a major influenza vaccine antigen. Here we identified a cluster of essential 
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104 residues at the tetramer formation interface, suggesting that it bears functional importance and 

105 can possibly be a drug targeting site. In contrast, such a large cluster of essential residues could 

106 not be found in any other part of the NA surface. The lack of essential residues on the NA 

107 surface explain the functional basis of antigenic drift. 
108 

109 We have also performed a structural analysis using the PA subunit of the influenza virus RNA 

110 polymerase as an example to search for indispensable regions to aid in rational drug design. 

111 Increasing evidence suggests PA is a valuable target for drug development due to its polyfunc- 

112 tionality [18-20]. Our fitness data provided an informative reference for rational drug design. It 

113 captured several critical interactions between PA and PB1, such as the hydrogen bond between 

114 PA E617 and PB1 Kll (Fig. 3A), and the hydrophobic interaction between PA and PB1 via 

115 the volume-filling residues L666 and F710 (Fig. 3B). It has also revealed a cluster of essential 

116 residues on the PA surface consisting of eight amino acids (Fig. 3C), including K539 and K574, 

117 which were previously shown to be part of a lead compound binding pocket [19]. This patch 

118 of amino acids may be involved in an essential protein-protein interaction for viral replication. 

119 Similar analyses using our dataset have been applied to PA endonuclease domain and the M2 

120 ion channel, which are plausible targets in drug development (Fig. S5-6). By projecting the 

121 fitness profiling data on three dimensional protein structures, it enables the identification of 

122 novel putative essential structural motifs that are surface exposed but not necessarily sequential 

123 in the primary sequence. This type of analysis reveals biological targets useful for rational drug 

124 and vaccine design. We propose that future antiviral drug design can incorporate the technique 

125 described in this study with in silico drug screening to increase the efficiency of therapeutic 

126 identification. 

127 Discussion 

128 Sequence conservation was often taken as the sole parameter for identifying residues essential 

129 for viral replication, although conservation is not equivalent to essentialness for viral replication. 
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130 It has been suggested that a significant fraction of conserved residues that are conserved in the 

131 influenza A virus are dispensable in viral replication [17,21,22]. In addition, new mutations were 

132 observed in every flu season, implying that residues that are naturally conserved currently may 

133 still be able to mutate under future unforeseen selection pressures. Therefore, a high-throughput 

134 fitness profiling complements the shortcoming in the sequence conservation analysis and allows 

135 identification of amino acid residues that are critical for viral replication in a defined cellular 

136 environment. 
137 

138 Here we provided a proof-of-concept study to profile the entire influenza A virus genome at 

139 single-nucleotide resolution. The fitness effects of individual point mutations were interrogated 

140 in a high-throughput manner by coupling a large mutant libary with NGS. However, the quan- 

141 tifiability of our platform can be further improved as sequencing technology advance. Similar 

142 experiments should be performed with strains across subtypes to identify mutations that display 

143 a genetic background-dependent fitness effect. These results would provide valuable information 

144 to dissect the evolutionary process of the influenza A virus. In addition, this platform can be 

145 applied to study the virus-host interaction under different cellular responses (such as apopto- 

146 sis, autophagy, inflammasome induction, ER stress, etc.) and immune responses (such as NK 

147 cells, T cells, antibodies, macrophages, cytokines, etc.) that influence the viral replication in 

148 nature [23,24]. Such results will significantly improve our understanding of the biological role 

149 of each residue on the genome of the influenza A virus. They will also help improve the design 

150 of a live attenuated influenza vaccine by minimizing the virulence. More importantly, it can 

151 potentially be adapted to other virus and microbes that can be genetically manipulated in the 

152 laboratory. 
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157 Materials and Methods 

158 Viral mutant library and point mutations 

159 The plasmid mutant libraries were created by performing error-prone PCR on the eight-plasmid 

160 reverse genetics system of influenza A/WSN/1933 (H1N1) [10]. We PCR-amplified the flu 

161 insert with error-prone polymerase Mutazyme II (Stratagene, La Jolla, CA). Mutation rate 

162 of the error-prone PCR was optimized by adjusting the input template amount to avoid the 

163 accumulation of deleterious mutations. The restriction enzyme sites BsmBI and/or Bsal were 

164 added to the PCR primers, and used to clone into a BsmBI-digested parental vector pHW2000. 

165 Ligations were carried out with high concentration T4 ligase (Invitrogen, Grand Island, NY). 

166 Transformations were carried out with electrocompetent MegaX DH10B T1R cells (Invitrogen), 

167 and > 100,000 colonies for each segment library were scraped and directly processed for plasmid 

168 DNA purification (Qiagen Sciences, Germantown, MD). As extensive trans-complementation 

169 was expected during the transfection step, > 35 million cells were used for transfection to average 

170 out any bias or artifact generated from possible trans-complementation. Point mutants for the 

171 validation experiment were constructed using the QuikChange XL Mutagenesis kit (Stratagene) 

172 according to the manufacturer's instructions. 

173 Transfections, infections, and titering 

174 C227 cells, a dominant negative IRF-3 stably expressing cell line derived from human embryonic 

175 kidney (293T) cells, were transfected with Lipofectamine 2000 (Invitrogen) using 7 wildtype 

176 plasmids plus 1 mutant (library) plasmid. Supernatant was replaced with fresh cell growth 

177 medium at 24 hrs and 48 hrs post-transfection. At 72 hrs post-transfection, supernatant con- 

178 taining infectious virus was harvested, filtered through a 0.45 um MCE filter, and stored at 

179 -80°C. The TCID50 was measured on A549 cells (human lung carcinoma cells). 
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180 

181 Virus from C227 transfection was used to infect A549 at an MOI of 0.05. Infected cells were 

182 washed three times with PBS followed by the addition of fresh cell growth medium at 2 hrs 

183 post-infection. Virus was harvested at 24 hrs post-infection. For the mutant library profiling, 

184 all viral mutant libraries were passaged for two 24-hour rounds in A549 cells. Our pilot exper- 

185 iments as well as our previous study revealed that two rounds of passaging were suffcient for 

186 profiling [25]. 

187 Sequencing library preparation 

188 DNA from the plasmid library or cDNA from the passaged viral mutant library were amplified 

189 with both forward and reverse primers each flanked with a 6 "N" tag and the flow cell adapter re- 

190 gion. Flanking region for 5' primer: 5'-CTACACGACGCTCTTCCGATCTNNNNNN-3\ Flank- 

191 ing region for 3' primer: 5'-TGCTGAACCGCTCTTCCGATCTNNNNNN-3'. Following PCR, 

192 93 amplicon products were pooled together. 15 million copies of the pooled product were used 

193 as the input for the second PCR, which was equivalent to 10 paired-end reads per molecule if 

194 150 million paired-end reads (approximately one lane on an Illumina HiSeq 2000 machine) were 

195 sequenced. 5'-AATGATACGGCGACCACCGAGATCTACACTC 

196 TTTCCCTACACGACGCTCTTCCG-3' and 5'-CAAGCAGAAGACGGCATACGAGATCGGTCTCGG 

197 CATTCCTGCTGAACCGCTCTTCCG-3' were used as the primers for the second PCR. Prod- 

198 ucts from the second PCR were submitted for NGS. The error-correction technique described 

199 in this study adapted the philosophy described for detecting rare mutations in human cells [26]. 

200 Raw sequencing data have been submitted to the NIH Short Read Archive under accession 

201 number: SRR1042008 (plasmid mutant library) and SRR1042006 (passaged mutant library). 

202 Data Analysis 

203 Sequencing reads were mapped by BWA with a maximum of six mismatches and no gap [27]. 

204 Amplicons with the same tag were collected to generate a read cluster. Since each read cluster 
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205 was originated from the same template, true mutations were called only if the mutations oc- 

206 curred in 90% of the reads withina a read cluster. Read clusters with a size below three reads 

207 were filtered out. Read clusters were further conflated into "error-free" reads. Relative fitness 

208 index (RF index) for individual point mutations was computed by: 
209 

210 (occurrence frequency in passaged library)/ (occurrence frequency in plasmid library) 
211 

212 For all the downstream analysis, only point mutations covered with ^ 30 tag-conflated reads 

213 ( "error- free" reads) in the plasmid library were included. This arbitrary cutoff filtered out 

214 mutants with low statistical confidence. 
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294 Figure Legends 

295 Figure 1. Single-nucleotide resolution fitness profiling. The RF index for individual 

296 point mutations across the genome was computed. Natural log of RF index, which is the ra- 

297 tio of occurrence frequency in the passaged library to the occurrence frequency in the plasmid 

298 library, represents the y-axis. Each nucleotide position is represented by four consecutive lines 

299 for the RF index that correspond to mutating to A (blue), T (green), C (orange), or G (red). 

300 The RF index of WT nucleotides is set as zero. Only point mutations with a coverage of ^ 30 

301 tag-conflated reads in the plasmid library are shown. Point mutations with < 30 tag-conflated 

302 reads in the plasmid library is plotted as a gray dot on the zero baseline. The data track for 

303 HA is adapted from Wu et al. [9] . 
304 

305 Figure 2. Experimental validation of severely attenuated and neutral mutations. 

306 Based on the data in Fig. 1, mutations that displayed a RF index of < 0.05 were classified as 

307 severely attenuated and > 0.4 were classified as neutral. Individual mutants were constructed 

308 and compared to the wild type (WT) replication phenotype. Post-transfection titers were plot- 

309 ted for lethal and viable mutants. Infection was initiated at an MOI of 0.05. Virus was harvested 

310 at 24 hours post infection. For the validated mutations with a RF index < 0.05, 68% have at 

311 least 1 log decrease in titer compared to WT. For the validated mutations with a RF index > 0.4, 

312 94% have a titer within a 2-fold change as compared to WT. Overall the validation rate is ~80%. 
313 

314 Figure 3. Structural analysis of the NA homotetramer interface. The RF index 

315 of the least destructive missense mutations for individual amino acids on the NA segment were 

316 projected on the protein structure (PDB: 3CL0) to identify for essential regions [28]. The RF 

317 index is color coded: RF index < 0.1, red; 0.1 ^ RF index < 0.2, orange; uncovered, grey. Only 

318 one monomer of the homotetramer is color coded. 
319 

320 Figure 4. Structural analysis of the RNA polymerase PA subunit. The RF index 
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321 of the least destructive missense mutations in the profiling data for individual amino acids on 

322 the PA segment are projected on the PA-PB1 complex crystal structure (PDB: 2ZNL) [29]. 

323 Most deleterious 10%, red; 10% to 20%, orange; Others, green. Our fitness data is capable to 

324 identify several critical interactions and putative functional sites. (D) A hydrogen bond between 

325 PA E617 and PB1 Kll is shown. Substitution of PA E617 is deleterious in our fitness data. (E) 

326 A hydrophobic interaction is shown between PA L666 and F710 and PB1. Substitution of L666 

327 is deleterious in our fitness data. (F) A cluster of eight essential residues on the surface of PA 

328 is shown. 
329 

330 Supplemental Figure 1. Distribution of conflated cluster size. Reads from the same 

331 amplicon with the same tag was defined as a cluster. The counts (number of reads) for all 

332 clusters are displayed as a histogram. Individual molecules, each carrying a unique tag, have 

333 an average copy number of ~10 in the sequencing data, thus validating the sequencing library 

334 preparation design. 
335 

336 Supplemental Figure 2. Comparison with BLOSUM62-based amino acid conser- 

337 vation. RF index of missense mutations from different segments were extracted and compared 

338 to amino acid conservation. The degree of amino acid conservation was quantified by the BLO- 

339 SUM62 matrix, a substitution matrix based on an implicit model of evolution. The x-axis 

340 represents the different cutoffs for BLOSUM62 values. The average RF index value for missense 

341 mutations that satisfied the cutoff was plotted against different BLOSUM62 cutoff values. The 

342 positive correlation between the RF index and the degree of amino acid conservation of mis- 

343 sense mutations indicates that our fitness data shows consistency with the evolutionary trend 

344 for missense mutations. 
345 

346 Supplemental Figure 3. The RF index of substitutions at different functional sites. 

347 (A) E339 and R416 on the NP protein form a salt bridge at the homodimer interface, which is 
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348 essential for viral replication [11]. This suggests that it is a feasible drug target. Several small 

349 molecules have been identified to target this interface and inhibit viral replication. (B) T271A 

350 has been identified as the replication enhancement substitution on PB2. T271A virus showed 

351 enhanced growth as compared to the WT strain in mammalian cells in vitro [12]. (C) NA 259Y 

352 (Nl naming: H274Y), a known oseltamivir drug resistance substitution, was shown to present 

353 a strongly attenuated phenotype in WSN [13]. In contrast, H259N (Nl naming: H274N), did 

354 not impose a deleterious effect in our fitness profiling data. This substitution is hypothesized 

355 to reduce influenza zanamivir sensitivity. Our results suggest further characterization of this 

356 substitution is warranted. (D) L26I, L26F, V27A and S31N on M2, the amantadine/rimantadine 

357 resistance substitutions [14, 15], were shown to impose little effect on viral replication. Our data 

358 is consistent with the observation that resistance substitutions emerged rapidly during aman- 

359 tadine/rimantadine drug treatment [5]. Green dotted line represents the average RF index for 

360 missense mutation at the indicated segment. Overall, the fitness data was consistent with the 

361 phenotypes of functional mutants reported in the literature. 
362 

363 Supplemental Figure 4. Structural analysis of Ml. (A) The RF index of the least 

364 destructive missense mutations for individual amino acids on the Ml segment were projected 

365 on the protein structure (PDB: 1EA3) to identify indispensable regions [30]. The RF index was 

366 color coded: RF index < 0.1, red; 0.1 ^ RF index < 0.2, orange. (B) The critical residues 

367 76RRR78 were displayed in stick format as an inset. It has been suggested that this basic amino 

368 acid stretch is important for virus assembly and/or budding [16]. Virus substitutions at these 

369 positions show an attenuated phenotype. Our data is consistent with the previous observation. 

370 The non-structural region at the C-terminal end of 76RRR78 is also indispensable in our profiling 

371 data. This suggests that entire the non-structural region containing the 76RRR78 basic stretch is 

372 functionally important. One possibility for functional importance is that it provides an interface 

373 for a protein-protein interaction. 
374 
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375 Supplemental Figure 5. Structural analysis of the PA endonuclease domain. The RF 

376 index of the least destructive missense mutations in the profiling data for individual amino acids 

377 on the PA segment are projected on the PA endonuclease crystal structure (PDB: 4E5G). Most 

378 deleterious 10%, red; 10% to 20%, orange; Others, green. A critical helix-helix interface, which 

379 consists of T40, V44, M47, 1171, R174 and 1178, is highlighted. It demonstrates the power of 

380 qHRG in identifying residues that are not continuous in the primary sequence. 
381 

382 Supplemental Figure 6. Structural analysis of the M2 ion channel. The RF in- 

383 dex of the least destructive missense mutations in the profiling data for individual amino acids 

384 on the M2 protein are projected on the M2 ion channel crystal structure (PDB: 2RLF) [31]. 

385 Most deleterious 10%, red; 10% to 20%, orange; Others, green. An indispensable region on 

386 the transmembrane helix is highlighted. Our data captured the essential amino acids W41 and 

387 H37, which are critical for M2 ion channel activation [31]. We also identified several adjacent 

388 hydrophobic residues, 135, L36, and L38 as critical residues, which can be attributed to their 

389 contact with the hydrophobic membrane. 
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